We will work with the data frame flights, which is
included in the nycflights13 package. To get started load
tidyverse and nycflights13.
#install.packages("nycflights13")
library(tidyverse)
library(nycflights13)
You may need to install nycflights13. Run
install.packages("nycflights13")in your RStudio Console
pane.
Package nycflights13 contains a data frame
flights that has on-time data for all flights that departed
NYC (i.e. JFK, LGA or EWR) in 2013. Take a few minutes to examine the
variables and their descriptions with regards to flights.
Run ?flights in your RStudio Console pane.
?flights
Object flights is a tibble. Another way to view the
tibble in order to see all variables is with function
glimpse().
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Before you get started, take a few minutes to refresh on some of R’s comparison operators detailed below.
| Operator | Description |
|---|---|
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
== |
equal to |
!= |
not equal to |
& |
and (ex: (5 > 7) & (6*7 == 42) will return the
value FALSE) |
| |
or (ex: (5 > 7) | (6*7 == 42) will return the value
TRUE) |
%in% |
group membership |
To evaluate group membership:
# Generating the group:
set.seed(634789234)
die.out <- sample(x = 1:6, size = 10, replace = T)
die.out
#Checking for group membership:
die.out %in% c(3, 4)
c(3, 4) %in% die.out
die.out %in% c(1)
c(1) %in% die.out
Package dplyr is based on the concept of functions as verbs that manipulate data frames.
| Function | Action and purpose |
|---|---|
filter() |
choose rows matching a set of criteria |
slice() |
choose rows using indices |
select() |
choose columns by name |
pull() |
grab a column as a vector |
rename() |
rename specific columns |
arrange() |
reorder rows |
mutate() |
add new variables to the data frame |
transmute() |
create a new data frame with new variables |
distinct() |
filter for unique rows |
sample_n / sample_frac() |
randomly sample rows |
summarise() |
reduce variables to values |
Make use of %>% operator and any of the functions in
package dplyr to answer the following questions.
Filter flights for those in January with a destination
of Detroit Metro (DTW) or Chicago O’Hare (ORD).
flights %>%
filter(month==1 & (dest %in% c('DTW','ORD')))
Filter flights for those before April with a destination
that is not Detroit Metro (DTW) and had an origin of JFK.
flights %>%
filter(month < 4 & (origin %in% c('JFK')) & (origin != 'DTW') )
Choose rows 1, 3, 7, 20 from flights.
flights %>%
slice(1,3,7,20)
Arrange flights by distance and then by departure delay, with the
sorting being in descending order in both cases. Hint:
desc()
flights %>%
arrange(desc(distance),desc(dep_delay))
Select only columns month, origin, and destination from
flights.
flights %>%
select(month,origin,dest)
Add a new variable to flights called gain,
where gain is the arrival delay minus the departure
delay.
flights %>%
mutate(gain = arr_delay - dep_delay,na.rm=TRUE)
Use summarise to obtain the mean departure delay and mean arrival delay for all flights with an origin of EWR.
flights %>% filter(origin == 'EWR') %>%
summarise(mean_dep_delay = mean(dep_delay,na.rm=TRUE),mean_arr_delay = mean(arr_delay,na.rm=TRUE))
Grouping adds substantially to the power of the dplyr
functions. We will focus on using summarise() with
group_by(), but grouping also can be used with other
dplyr functions.
Create a data frame which contains the number of flights and the mean arrival delay for flights on carrier UA (United Airlines) whose destination is O’Hare Airport (ORD). The number of flights and mean arrival delay is calculated separately for flights out of each of the origin airports.
ua_ord_summary <- flights %>%
filter(carrier == "UA", dest == "ORD") %>%
group_by(origin) %>%
summarise(
n_flights = n(),
mean_arr_delay = mean(arr_delay, na.rm = TRUE)
)
ua_ord_summary
Create a data frame which contains the mean number of flight hours for carrier UA (United Airlines) originating from Liberty International Airport (EWR) to each unique destination.Arrange the data in descending order.
ua_ewr_hours <- flights %>%
filter(carrier == "UA", origin == "EWR") %>%
mutate(flight_hours = air_time / 60) %>%
group_by(dest) %>%
summarise(
mean_flight_hours = mean(flight_hours, na.rm = TRUE)
) %>%
arrange(desc(mean_flight_hours))
ua_ewr_hours